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Abstract 

Background: Characterization of genome-wide patterns of allelic variation and linkage disequilibrium can be used 
to detect reliable phenotype-genotype associations and signatures of molecular selection. However, the use of 
Sesomum indicum germplasm for breeding is limited by the lack of polymorphism data. 

Results: Here we describe the massively parallel resequencing of 29 sesame strains from 12 countries at a depth 
of > 13-fold coverage for each of the samples tested. We detected an average of 127,347 SNPs, 17,961 small InDels, 
and 9,266 structural variants per sample. The population SNP rate, population diversity (tt) and Watterson's estimator 
of segregating sites (Gw) were estimated at 8.6 x 10"^ 2.5 x 10"^ and 3.0 x 10"^ bp"\ respectively. Of these SNPs, 
23.2% were located within coding regions. Polymorphism patterns were nonrandom among gene families, with 
genes mediating interactions with the biotic or abiotic environment exhibiting high levels of polymorphism. The linkage 
disequilibrium (LD) decay distance was estimated at 150 kb, with no distinct structure observed in the population. 
Phylogenetic relationships between each of the 29 sesame strains were consistent with the hypothesis of sesame 
originating on the Indian subcontinent. In addition, we proposed novel roles for adenylate isopentenyltransferase (ITP) 
genes in determining the number of flowers per leaf axil of sesame by mediating zeatin biosynthesis. 

Conclusions: This study represents the first report of genome-wide patterns of genetic variation in sesame. The 
high LD distance and abundant polymorphisms described here increase our understanding of the forces shaping 
population-wide sequence variation in sesame and will be a valuable resource for future gene-phenotype and 
genome-wide association studies (GWAS). 
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Background 

Sesamum indicum (sesame) is an ancient crop with a 
mid-range genome size of -357 Mb, and contains high 
concentrations of oils and proteins with medicinal value. 
However, this species is prone to waterlogging, and is 
particularly susceptible to many fungal and bacterial dis- 
eases, such as stem and root rot, Fusarium wilt, powdery 
mildew and others. These biotic and abiotic stresses can 
lead to lower overall yields, with outputs strongly associ- 
ated with growth conditions. To overcome environmen- 
tal stresses and improve yields, abundant germplasm 
along with genetic information are required for plant- 
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breeding programs [1], and characterization of genome- 
wide patterns of allele variation and linkage disequilibrium 
ensure the detection of reliable phenotype-genotype 
associations and signatures of molecular selection [2]. 
India, China and Korea are the leading countries for 
sesame germplasm collection, preservation and research 
[3]. In China, -6,000 strains of sesame have been deposited 
in the National Gene Bank of China (Wuhan, medium- 
term Genebank; Beijing, long-term Genebank). In Korea, 
> 7,698 variants have been preserved in the Gene Bank of 
the Rural Development Administration (RDA) located 
in Suwon, Korea [4], and in India > 10,000 variants have 
been archived in the National Bureau of Plant Genetic 
Resources (New Delhi, India). However, few studies have 
examined the genetic diversity of the sesame germplasm 
on a genome-wide scale due to a lack of genomic informa- 
tion and an absence of suitable biomarkers [1,5-7]. 
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Sesame is the most common cultivar of the genus 
Sesamum, which contains more than 20 species of flow- 
ering plants. UnUke sesame, the majority of species in 
this genus have not been domesticated, with significant 
divergence in polyploidy levels [1]. As most of these wild 
species are native to sub-Saharan Africa, sesame was ori- 
ginally believed to have originated in Africa; however, 
domesticated sesame has since been shown to have orig- 
inated on the Indian subcontinent [8,9]. Further investi- 
gation into the evolution of sesame has been hampered 
by the absence of detailed molecular data across mul- 
tiple sesame strains. 

Completion of the sesame reference genome provides an 
essential resource for exploring the genetic variation of wild 
and domesticated S, indicum (http://www.ocri-genomics. 
org/Sinbase/). Here, we analyzed 29 resequenced sesame 
strains collected from 12 countries at a coverage of> 
13-fold. From these data, we have constructed the first 
haplotype map for sesame, which provides insight into 
the genetic diversity of sesame across multiple strains. 
These data can be used for the development of genome- 
wide association studies, and in turn facilitate the mapping 
of genes associated with both simple and complex traits. 



Results and discussion 

Phenotype diversity of resequenced sesame strains 

We manually selected 29 sesame strains for genome 
resequencing, including 6 from its presumed origin sites 
of India and Africa, 16 from China, 2 from the United 
States and 1 each from Afghanistan, the United Arab 
Emirates, Korea, Myanmar, the Philippines and Viet 
Nam (Additional file 1: Data SI). These strains exhibited 
a wide range of phenotypes, including determinate and 
indeterminate growth habits, tall and short plant height, 
early and late flowering, different seed coat color, single 
and triple flowers per leaf axil, uniculm and branching 
style, and others. The distant geographic relationships 
and wide phenotype variation made these strains an 
ideal model for exploration of the genetic diversity of 
cultivated sesame (Figure 1). 

Landscape of the genetic diversity of sesame 

To identify large-scale polymorphisms and better under- 
stand the genetic structure of the sesame germplasm, 
each of the 29 sesame strains were re-sequenced, gener- 
ating more than 120 Gb of filtered data at a coverage 
depth of > 13x for each strain (Additional file 2: Figure SI; 
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Additional file 1: Data SI). All sequence reads were 
aligned against the reference genome of "Zhongzi No. 13", 
which has an effective genome length of 274 Mb (http:// 
www.ocri-genomics.org/Sinbase/), using the BWA soft- 
ware [10]. The mapping rate across different strains varied 
from 88.8% to 95.2%, for an average of 91.4%. The map- 
ping result is consistent with that from the GATK soft- 
ware (Additional file 1: Data SI). 

Using a stringent pipeline, we identified an average of 
127,347 single nucleotide polymorphisms (SNPs) per 
strain using the SAMtools software [11], ranging from 
40,925 to 392,544 (Table 1; Additional file 2: Figure S2). 
Overall, SNP rates ranged from 1.5 x 10'^ to 14.3 x 10'^ 
respectively, with G:A, A:G, C:T and T:C substitutions be- 
ing the most common (Additional file 2: Figure S3). By 
combining SNP across all strains, we identified 2,348,008 



unique SNPs, for a population SNP rate of 8.6 x 10'^ bp'^. 
We next employed GATK software to call and calculate 
the SNP population, which resulted in generation of a 
total of 2,003,821 population SNPs. The concordance 
rates between GATK and SAMtools ranged from 80.0% 
to 89.9% with an average of 85.4% for the 16 linkage 
groups (pseudomolecular chromosomes) (Additional file 1: 
Data S2). Sanger sequencing showed that the SNP calling 
accuracy rate was -93.7% (Additional file 2: Figure S4; 
Additional file 1: Data S3). These results indicated that the 
majority of SNPs detected here were reliable. Of these 
SNPs, 25.1% were located within gene coding regions 
with 1.5%, 7.5% and 1.7% in the 5' untranslated region 
(UTR), coding sequence (CDS) and 3'UTR, respectively 
(Additional file 2: Figure S5; Additional file 1: Data S4). 
The top three SNP rates were detected in strains 20, 24 



Table 1 Summary of DNA variations in the 29 sesame strains 



No. 


Strain name 


Total bases (Gb) 


Total SNPs 
(number) 


Total SNPs 
in gene (%) 


Total InDels 
(number) 


Total InDels 
in gene (%) 


1_CHN 


Zhongzhi No.l5 


4.21 


73,716 


28.5 


11,441 


28.6 


2_CHN 


Zhongzhi No.l 1 


4.22 


52,471 


17.1 


4,495 


27.9 


3_CHN 


H98 


4.05 


123,830 


22.8 


1 7,998 


27.4 


4_CHN 


Jizhi No.l 


4.18 


107,656 


21.4 


9,931 


24.9 


5_CHN 


Jinzhi No.2 


4.09 


92,668 


20.2 


5,966 


25.6 


6_CHN 


Zhima8131 


3.81 


109,905 


37.1 


25,752 


33.9 


7_CHN 


2009-3335_2 


4.06 


208,773 


27.8 


37,602 


28.3 


8_CHN 


ZZM2541 


4.12 


62,528 


25.3 


8,128 


32.4 


9_CHN 


Yiyangbai 


4.22 


68,640 


24.2 


11,851 


23.5 


10_CHN 


Zihuaye 23 


4.20 


46,507 


17.8 


7,289 


24.3 


11_CHN 


Baizhima 


4.21 


1 76,649 


25.4 


32,028 


26.9 


12_CHN 


Zhima 


4.17 


68,617 


20.7 


11,531 


25.5 


13_CHN 


Mishuozhima 


4.17 


89,048 


21.1 


14,073 


25.1 


14_CHN 


Bahuama 


4.21 


160,011 


32.3 


32,041 


28.9 


15_CHN 


Xiangheizhi 2078 


4.17 


92,303 


19.1 


8,676 


21.8 


16_CHN 


Fuyangsilengcao 


4.07 


79,310 


16.8 


9,436 


26.5 


1 7_AFG 


0725 


4.10 


136,555 


29.2 


21,952 


34.7 


18_EGY 


L161 


4.15 


203,642 


27.6 


26,351 


28.9 


19_GUI 


Kl 


4.12 


137,736 


20.0 


12,656 


26.4 


20JND 


0847 


4.15 


392,544 


30.9 


56,594 


30.0 


21_K0R 


Shuiyuan 1 17 


4.19 


63,828 


17.4 


8,815 


22.4 


22_MOZ 


Suke No5- < 2> 


4.14 


60,018 


13.7 


7,168 


25.3 


23_MOZ 


Jasbrouk 


4.16 


61,093 


19.1 


8,929 


27.0 


24_MYA 


Miandianhei 


4.19 


253,189 


23.0 


44,538 


27.5 


25_PHI 


CLSU-1 


4.08 


200,737 


22.3 


16,553 


27.9 


26_UAE 


24-1 


4.12 


260,055 


23.1 


31,722 


27.2 


27_USA 


U.C.R/82NOINS 


4.16 


54,410 


21.3 


5,364 


27.9 


28_USA 


Youxianxing N03 


4.18 


215,692 


31.9 


26,637 


32.7 


29_VIE 


V6 


4.17 


40,925 


15.8 


5,363 


20.3 
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and 26, which originated in India, Myanmar and the 
United Arab Emirates, respectively; thus these regions 
may harbor a more diverse sesame germplasm. 

Although sesame is traditionally considered a self- 
pollinating plant, it is also subjected to crossed pollin- 
ation by insects such as butterflies and bees. This is 
consistent with the observed rate of heterozygosity ran- 
ging from 1237 to 49.30%, with an average of 25.39% 
(Additional file 2: Figure S6; Additional file 1: Data S5). 
Five of the 16 Chinese cultivars (strains 1 to 5) exhibited 
lower heterozygous rates (16.82 - 23.25%) compared to 
both mean levels and other non-Chinese strains. The 
unusually high heterozygosity rates for strains 16, 22, 
and 26 suggest that these strains are more attractive to 
insects; however, more trivial explanations, such as se- 
quencing and alignment errors, cannot be ruled out, es- 
pecially in the repeat regions. 

Population SNPs were used to calculate two com- 
monly used population genetics statistics: population di- 
versity (tt) and Watterson s estimator of segregating sites 
(Ow). The average n and 6w values across all 29 strains 



were 2.5 and 3.0 per kb, respectively, which are lower 
than that of rice [12] but higher than chickpea {Cicer 
arietinum) [13], watermelon {Citrullus lanatus) [14] and 
soybean [15] (Additional file 1: Data S6). We observed 
numerous consecutive slide windows along with the pseu- 
domolecules (LG1-LG16) that contained fewer than nor- 
mal SNPs, and in turn lower n and Ow values (Figure 2), 
indicative of an uneven distribution of genetic diversity 
along sesame pseudomolecules. We examined the genome 
for the highest and lowest polymorphic regions (blocks 
falling in the top and bottom 5% of n values) and found 
that the number of genes in the highest polymorphic re- 
gions was smaller than in the lowest regions (524 vs. 
1308) (Additional file 1: Data S7 and S8), similar to other 
species, such as chickpea [13]. Many of the genes in the 
highest polymorphic regions were related to environ- 
mental adaptability, including stress response pathways 
(Additional file 1: Data S9). These genes may offer a 
valuable resource for the study of biotic and abiotic 
stress in sesame. It is also interesting to note that des- 
pite the greater number of genes in the lowest 




Figure 2 Landscape of the genetic variation in sesame. Distribution of (A) pseudomolecules, (B) gene density (mRNA), (C) average InDel 
density, (D) population SNPs, (E) large-effect SNPs, (F) n values, (G) DNA transposon element density, and (H) retrotransposon element density 
across the sesame genome. 
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polymorphic regions, only five genes were enriched in 
two gene ontology (GO) terms, all of which were asso- 
ciated with basic biological functions; i.e., ribosome 
binding (Additional file 1: Data SIO). 

We next used the mapped reads that met all pair-end 
requirements, but contained alignment gaps in one end of 
the contig to detect short InDels (1-5 bp) in each strain. 
The overall number of InDels detected was inversely pro- 
portional to the length of the InDel (Additional file 2: 
Figure S7). The numbers identified across all 29 strains 
ranged from 4,495 - 56,594 (average = 17,961), for a 
total of 520,880 unique InDels (Additional file 1: Data 
Sll). Similar to SNPs, the distribution of InDels along 
the genome was not uniform, with high-density regions 
strongly associated with regions containing high SNP 
density (Figure 2). Among these InDels, the numbers of 
insertions and deletions were similar (48.8% vs. 51.2%, 
respectively). Homozygous InDels were found at a rate 
more than 1.5 -fold that of heterozygous InDels. Of 
these InDels, 71.7% were located in intergenic regions, 
1.5% (8,221) in CDS and 5.0% in UTRs, respectively. 

Structural variation (SV) was originally defined as in- 
sertions, deletions, DNA inversions and other sequence 
rearrangements greater than 1 kb in size [16]. With se- 
quencing now becoming routine [17], the operational 
spectrum of structural variants has widened to include 
much smaller events [18,19]. In the present study, we 
detected SVs between 10 bp and 1 Mb using the soft- 
ware package Breakdancer vl.2 [20] set to default pa- 
rameters. We found 7,220 - 12,458 SVs per strain 
(average = 9,266) across all 29 strains when compared to 
the reference genome (Additional file 1: Data SI 2). For 
these SVs, deletion events outnumbered insertions at a 
rate of nearly two to one (Additional file 2: Figure S8). 
Outside of InDels, the rates of other SVs, including 
DNA inversion, intrachromosomal translocation and in- 
terchromosomal translocation were relatively low, ran- 
ging from 739 to 2,360 (average = 1,140). The majority 
of SVs were between 100 - 1000 bp in size, with longer 
variations (>1 kb) less abundant, especially those longer 
than 10 kb (Additional file 2: Figure S9), consistent with 
that seen in sorghum [21]. 

The effect of variations on genes 

DNA sequence changes within genes plays a pivotal role 
in morphology and plant evolution. Of the 27,148 anno- 
tated genes in sesame genome (http://www.ocri-genom- 
ics.org/Sinbase/), 74.8% (20,311) were found to contain 
one or more SNPs in comparison to the reference gen- 
ome. Furthermore, 62.6% (16,997), 15.5% (4,218), and 
18.0% (4,892) of genes contained SNPs in their CDS, 5' 
UTRs, and 3 'UTRs, respectively. These genes were cate- 
gorized into 43 molecular function groups, with 30% 
associated with the terms protein binding, hydrolase 



activity and ATP binding; however, all genes with pre- 
dicted hydrolase activity contained SNPs only within 
their CDS regions (Additional file 2: Figure SIO). Fur- 
ther analysis identified 258 genes with SNPs in their 
CDS regions that were significantly enriched {P < 0.01) 
for the biological processes cell death and apoptotic 
process (Additional file 1: Data S13). The 136,130 non- 
synonymous and 142,103 synonymous SNPs identified 
in coding regions represent a non-synonymous-to-syn- 
onymous substitution ratio of 0.99 (Additional file 1: 
Data S4; Additional file 2: Figure Sll), similar to that of 
sorghum (1.0) [22], but higher than that oi Arab idop sis 
thaliana (0.83) [23] and lower than that of soybean 
(1.38) [15] and rice (1.2) [24]. GO term enrichment for 
genes with non-synonymous SNPs were strongly associ- 
ated with cell death, apoptosis, and defense response 
(Additional file 1: Data S14), particularly those genes en- 
coding disease resistance proteins, UDP-glucosyltransferase 
or the proteins containing leucine-rich repeats and NB- 
ARC domains (Additional file 2: Figure SI 2; Additional 
file 1: Data SI 5). These results are indicative of a higher 
rate of mutation in genes involved in biotic stress re- 
sponses, consistent with the theory that plant-pathogen 
interactions result in the diversification of pathogen- 
associated molecular pattern recognition receptors [25,26]. 

Coding region SNPs located in key structural locations 
can lead to significant changes in protein morphology, 
and in turn cause changes in overall protein function. 
Within the 29 sesame strains examined, we identified 
1,281 SNPs associated with the formation of premature 
stop codons and 246 stop codon to non-stop codon mu- 
tations. Start codon to non-start codon mutations were 
observed in 186 genes, along with an additional 404 
splice site mutations (Additional file 2: Figure SI 3). Most 
of these large-effect SNPs were located on the proximal 
ends of the pseudomolecules (LG) (Figure 2). Annota- 
tion of these four large-effect SNPs categories revealed 
different patterns of functional enrichment. For ex- 
ample, start codon to non-start codon mutations were 
found primarily in genes involved in transport, apop- 
tosis, and defense response, while splice site mutations 
were more common in genes associated with cellular me- 
tabolism, oxidation-reduction, organic substance metabol- 
ism, and nitrogen compound metabolism (Additional 
file 2: Figure S14). Among the four types of large-effect 
SNPs, premature stop codons were particularly inter- 
esting, as these mutations are often associated with loss 
of function. The majority of the mutations were found 
in genes associated with the GO biological processes 
related to adversity, including cell death, apoptosis, and 
defense response (Additional file 1: Data SI 6). 

Despite the fact that most SNPs were detected in CDS 
regions, CDS regions accounted for only 14.3% of the 
12,651 InDel mutations, lower than both the 5' and 3' 
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UTRs (18.4% and 19.6%, respectively). The number of 
genes containing InDels in the 5' and 3' UTRs de- 
creased as InDel size increased from 1 to 5 bp, but the 
CDS InDels increased significantly in 3-bp InDels, simi- 
lar to that observed in sorghum [21] (Additional file 2: 
Figure SI 5). This enrichment of 3-bp InDels is to be ex- 
pected, as InDels that are not multiples of 3 bp result in 
frame shifts and are usually fatal. Finally, we analyzed 
the distribution of InDels on the basis of gene ontology, 
and found they were similar to SNPs resulting in prema- 
ture stop codons, with statistically significant enrichment 
(P < 0.001) in genes involved in cell death, apoptosis, and 
defense response (Additional file 1: Data SI 7). 

Genetic relationships among the 29 sesame strains tested 

When considering whether sesame was first cultivated in 
Africa or on the Indian subcontinent [8,9], it is import- 
ant to investigate the effects of geography on sesame 
genetic diversity. A phylogenetic tree containing all 29 
sesame strains was constructed using the neighbor- 
joining method. This analysis revealed the highest de- 
grees of relatedness among the Chinese strains, with 
strains originating in other countries spread throughout 
(Figure 3a). This interwoven nature of sesame strains de- 
rived from different geographic locations was also evi- 
dent based on principal component analysis (PCA) 
(Figure 3b and c). Indistinct groups were observed using 



the Bayesian clustering software STRUCTURE [27] with 
K changing progressively from 2-5 (Figure 4a). 

As this study did not include any relatives or wild spe- 
cies of sesame, definitive conclusions regarding the origins 
of sesame are not possible. However, the phylogenetic re- 
lationships observed among the 29 sesame strains shed 
some light on the evolution of sesame. The three strains 
from India, Myanmar, and the United Arab Emirates ex- 
hibited higher genetic distances relative to the other 
strains (Figure 3a). According to the Vavilov center of di- 
versity theory, which states that richer genetic diversity is 
observed in the location where a plant was first domesti- 
cated [28], these results suggest that sesame originated on 
the Indian subcontinent. 

High linkage disequilibrium in sesame 

LD patterns are necessary to determine mapping resolution 
when designing association studies [29,30] and interpreting 
association peaks [31]. To estimate the LD of sesame, we 
calculated r^ between pairs of SNPs using Haploview [32] 
and found that it decayed to -0.15 from an initial value of 
0.30 over the course of -150 kb (Figure 4b and c). The LD 
decay estimate of sesame is comparable to that of self- 
pollinated soybean (-150 kb) [15], but higher than that 
seen in A. thaliana (~4 kb) [29], indica rice (-65 kb) [12] 
and foxtail millet (-100 kb). It was also significantly higher 
than that of cross-pollinated plants such as sorghum (1 kb) 




# 3 CHN 



+ China 
+ Other 



0.2 0.3 
Eigenvector 1 



+ China 
+ Other 



-0.2 
Eigenvec 



Figure 3 Genetic relationships among the 29 sesame strains tested, (a) Neighbor-joining (NJ) tree analysis of 29 sesame strains based on 
population SNPs. (b, c) PCA results for the first four statistically significant components. 
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0<ataoo^KI>p> 




Figure 4 Population structure and linlcage disequilibrium of sesame, (a) Structure analysis of 29 sesame strains based on wliole-genome se- 
quencing, (b) LD decay determined using squared correlations of allele frequencies (r^) against distance in sesame, (c) LD of sesame, shown using 
a slide window of 100 kb. Red and white spots indicate strong (r^= 1) and weak (r^ = 0) LD, respectively. 



[33] and maize (<1 kb) [34]. The high LD of sesame 
makes it not only a good plant for studying the effects 
of extreme LD in genomic and population structures 
[15], but also suitable for GWAS with relatively few 
polymorphic markers. 

Bulked segregant analysis for the candidate sites of the 
number of flowers per leaf axil in sesame 

Bulked segregant analysis (BSA) is a rapid method that 
allows for the detection of markers in specific genomic 
regions [35] and has been successfully applied to detect 
quantitative trait loci (QTL) or genes for various traits 
in rice [36], maize [37], and wheat [38]. In combination 
with high-throughput sequencing technology, BSA has 
been used to identify a novel xylose utilization gene 
from Saccharomyces cerevisiae. Here, we employed the 
BSA method to explore candidate genes that may be re- 
sponsible for the number of flowers per leaf axil. This 
phenotype is an important agronomic trait in sesame as 
it plays a role in the predicted yield. The 29 sesame 
strains were classified into two groups based on mono- 
flower versus triple-flower (13 versus 16) (Additional file 1: 
Data SI). We identified 695 genes with coincident SNPs 
between the two pools. Of these genes, 181, 21 and 31 
contained SNPs in the CDS, 5'UTR and 3'UTR, re- 
spectively (Additional file 1: Data S18). GO term 



annotation associated these genes predominantly with 
ATP binding, zinc ion binding, nucleic acid binding 
and heat shock protein binding. Of particular interest 
were six adenylate isopentenyltransferase (ITP) homo- 
logs (SIN_1002735; SIN_1000260; SIN_1000476; SIN_ 
1000477; SIN_1016115 and SIN_1001679), which were 
significantly enriched in the zeatin biosynthesis path- 
way (Figure 5). Zeatin is a member of the phytohor- 
mone family of cytokinins, which is known to be 
involved in a variety of processes associated with the 
growth and development of plants, including promotion 
of lateral bud growth and stimulation of cell division to 
produce bushier plants [39,40]. The present results sug- 
gest that ITP genes may also play a role in the number of 
flowers per leaf axil of sesame by mediating zeatin biosyn- 
thesis. However, further studies using transgenic models 
or two-parent crossing populations are required. 

Conclusions 

Next-generation sequencing is rapidly increasing our un- 
derstanding of genetic variation in crop plants [41]. This 
study provides the first comprehensive resequencing 
analysis of the high oil crop sesame. The availability of 
these data, generated from 29 strains originating from 
12 countries, provides insight into genetic variation of 
the sesame germplasm genome and facilitates a broad 
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Terpenoid backbone biosynthesis 




Isopentenyl-ATP 

— I — 

Isopentenyl-ADP 



Isopentenyl-AMP 



CYP735A 




Isopenten^ 


^l-adenosine 


y 


f 


Isopentenyl-adenine 



Trans-zeatin 

riboside 
triphosphate 



[ Trans-zeatin 
' riboside 
. diphosphate 



Trans-zeatin 
riboside 
monophosphate 

1 



Trans-zeatin 
riboside 



Figure 5 Positions of tiie six ITP homologs in the zeatin biosynthesis pathway. DMAPP: Dimethylallyl pyrophosphate; CY735A: Cytokinin 
trans-hydroxylase; ATP: Adenosine triphosphate; ADP: Adenosine diphosphate; AMP: Adenosine monophosphate. 



range of functional and evolutionary studies including 
on genome evolution, population genetics, marker- 
assisted breeding and gene identification. The identifica- 
tion of high LD in the sesame genome indicates that 
marker-assisted breeding is a better choice for sesame 
improvement. The data presented here provide new evi- 
dence supporting the hypothesis that sesame originated 
on the Indian subcontinent. In both coding and noncod- 
ing regions, we identified hundreds of thousands of poly- 
morphisms, which provide an important resource for 
both evolutionary genetic and functional studies. Of par- 
ticular interest are genes harboring non- synonymous 
mutations, including large-effect SNPs, which are likely 
to mediate interactions with the environment. This study 
also suggested that the ITP genes might play a role in 
determining the number of flowers per leaf axil of ses- 
ame. However, further studies are required to fully 
understand the functional relevance of the genetic varia- 
tions identified in this study. 

Methods 

Twenty-nine cultivated sesame strains were selected for 
genome resequencing, including 16 from China, 2 from 
the United States, and 1 each from Afghanistan, the 
United Arab Emirates, Korea, Myanmar, the Philippines, 
and Viet Nam. 

Library construction and sequencing 

Genomic DNA was extracted from fresh and etiolated 
leaves of each strain using the CTAB method. Paired- 



end sequencing libraries with inset sizes of -500 bp were 
constructed for each strain according to the manufac- 
turers instructions (Illumina) using 5-(ig genomic DNA. 
Sequencing was performed using the Illumina Hiseq 
2000 platform. Raw sequencing reads were then sub- 
jected to a series of stringent filtering steps, removing 
reads based upon the following criteria: 

Type (1): Reads with > 10% and > 3% unidentified nucle- 
otides for short and long insert size libraries, respectively. 

Type (2): Reads having > 40% of bases with quality 
scores < 7. 

Type (3): Reads of > 10 bp aligned to the adapter se- 
quence, allowing < 2-bp mismatches. 

Type (4): Paired-end reads that overlapped > 10 bp 
with the corresponding paired end. 

Type (5): Readl and read2 of two paired-end reads 
that were completely identical (considered to be prod- 
ucts of PGR duplication). 

A total of > 120 Gb was generated following all filtering 
steps, at a depth of > 13-fold (Additional file 1: Data SI). 

SNP calling 

Reads were mapped to the assembled sesame genome of 
"Zhongzhi No.l3" using BWA software [10]. The detailed 
parameters used were as follows: 

"bwa aln -m 200000 -o 1 -e 30 -i 15 -1 35-L -I -t 4 -n 
0.04 -R 20 -f 

"bwa sampe -a 800" 

Gonsidering all strains as a group, we used the SAM- 
tools function "mpileup" [11] to detect raw population 
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SNPs using reads with a mapping quality > 20. The pa- 
rameters used were as follows: 

"samtools mpileup -uf -b -D| bcftools view -bvcgl -p 
0.99" 

Using the SAMtools program "vcfutils", SNPs extracted 
using the above process were first filtered to yield sequen- 
cing depths between 30 and 581. The parameters used 
were as follows: 

"perl vcfutils.pl varFilter -d 30 -D 581" 

Raw SNP sites were further filtered based on the fol- 
lowing criteria: copy number < 2, and a minimum of 
5 bp apart, with the exception of minor allele frequen- 
cies (MAP > 0.05) where SNPs were retained when the 
distance between SNPs was < 5 bp. The diversity param- 
eters 71 and 6^ were measured using a window of 10 kb 
with a sliding window of 1 kb [12,14]. 

To check the SNP calling accuracy of SAMtools, four 
fragments ranging in size from 4.5 to 8.1 kb were ran- 
domly selected and amplified using overlapped primers, 
and the resulting PGR products subjected to Sanger se- 
quencing. The concordance rates of SNPs detected by 
the two methods ranged from 92.3 to 95.2 (average = 
93.7%) (Additional file 2: Figure S4; Additional file 1: 
Data S3). 

In addition, the GATK toolkit [42] was also used to 
call SNPs, as follows: 

We first mapped clean reads to the sesame genome 
using the BWA software with the following parameters: 

"bwa aln -m 200000 -o 1 -e 30 -i 15 -1 35-L -I -t 4 -n 
0.04 -R 20 -f 

"bwa sampe -a 800". 

SAMtools was used to split, sort, rmdup and merge 
the SAM aligned result, and picard-tools was used to 
sort the bam result and was marked as the duplicate. 
Next, we used the GATK program to realign and filter 
SNPs from the unified genotyper raw VGF using the 
parameters: 

java -jar GenomeAnalysisTK.jar -T SelectVariants -R - 
variant -concordance -o 

java -jar GenomeAnalysisTK.jar -T VariantFiltration - 
R -filterExpression "QD < 20.0 1 1 ReadPosRankSum < -8.0 
II FS>10.0 II QUAL < $MEANQUAL" -filterName 
LowQualFilter -missingValuesInExpressionsShouldEvaluat 
AsFailing -variant -logging_level ERROR -o 

java -jar GenomeAnalysisTK.jar -T CombineVariants - 
R -V samplel.vcf -V sample2.vcf -genotypeMergeOptions 
UNIQUIFY -o ^ 

A total of 2,003,821 population SNPs were obtained 
from the 16 linkage groups using GATK. 

Short InDel detection 

Using the default parameters of the software SOAPInDel 
[43], primary short insertions or deletions up to 5 bp 
were extracted based on the mapped reads that meet the 



pair-end requirements and contain alignment gaps, with 
all gaps supported by at least three non-redundant paired- 
end reads. Primary InDel sets were then filtered to include 
read quality values > 20 and InDels < 5 bp away. 

Structure variation detection 

According to the principal of paired-end sequencing, one 
of the paired-end reads should be aligned to the forward se- 
quence, while the other is aligned to the reverse sequence. 
The distance between the two aligned positions should be 
in accordance with the insert size. Thus two paired-end 
reads aligned to the genome should have normal direction 
and appropriate span. Abnormal paired-end alignments 
were analyzed by clustering and comparing with the types 
of structure variation previously defined using the software 
Breakdancer [20] run using the default parameters. The 
resulting SV dataset included INS (insertions), DEL (dele- 
tions), ITX (intrachromosomal translocations), INV (inver- 
sions) and CTX (interchromosomal translocations) ranging 
from 10 bp to 1 Mb. 

Calculation of linkage disequilibrium 

To measure LD in the population, we calculated the 
correlation coefficient (r^) of alleles using the software 
Haploview [32], as follows: 

(1) Ped and info files were generated as input files. 

(2) For each chromosome, such as LGl, the parameters 
were set as "java -jar haploview.jar -n -log LGl.log 
-pedfile LGl.genotype.ped -info LGl.genotype.info 
-dprime -minGeno 0.6 -minMAF 0.01 -hwcutoff 
0.001 -memory 2000 -maxdistance 500". 

(3) Curves were then plotted with R scripts, which draw 
averaged (r^) against pair wise marker distances. 

Population genetics analysis 

The diversity parameters tt and 6^^, were measured using 
a window of 10 kb with a sliding window of 1 kb 
[12,14]. The top and bottom 5% blocks based upon tt 
value were extracted, and the genes in these blocks de- 
fined as high- and low- divergence genes, respectively 
(Additional file 1: Data S7 and S8). 

Individual SNPs were used to calculate distances be- 
tween samples. Under the p-distances model with boot- 
strapping (1,000), a neighbor-joining tree was constructed 
with TreeBest (http://sourceforge.net/projects/treesoft/files/ 
treebest/) for the 29 sesame strains. The phylogenetic 
tree was displayed using the software MEGA5 [44]. was 
performed using the software EIGENSOFT [45]. The soft- 
ware FRAPPE [46] was used to determine the population 
structure. 
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